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processing such data have been developed - m on. , • 

one examines hybridization between a til* J , a PP^ation, 

zrz r.:-rr- * ~: zlz 

the identification o£ genetic va^aZTLe™ e L iDClUde 
sue as BXV or genetic diseases , such m - - eases. 

Other ways o£ obtaining useful information T 
hybridization data would be of benefit t„t ™ 
medical communities, benefit to the scientific and 

SDKKJLHY OP THE UJVKNTIOH 

Of array-ba^e\ ™ ie in :: n " 0n h inV ° 1VeS * "-archical method 

/ . anal y siB m which single nucleotide base 
determination may or may not be one step. The present 
invention has several • Present 

.termination of a ll^Z^T ^ 
signatures include polynucleoti rt« ^ y Ce 

^natures such ae'thol ^"^TSS ~ 
families, different qenes In * „„. 3 
„ , ... genes in a genome, repeat sequences 

polymorphic forms of a gene. The methods involved " ' 
hybridization assays between the target nucleic Ju ml f 
to be screened and polynucleotide arrays designed o ?£T 

contain ~ ~ ^ ~ "~~ ^r^ 
«- ^nce of the ^^^T- t ^r'"' ^ 
that sequence. Thereby, the probes define ST^LT" 
seance signature and sequences related to the s\™ 
signature. A hybridization assay between th» equence 
and the probes in i-h= oetween the target molecule 

one probes in the array generates data about which „r* 
the target has hybridized to, as well as the extent l* 

^:iz\i^ittj- a rr c= * — 

target has hybridLed'co ^ efinl te^r^r 
seances, or to probes defining seances that IZ^IT 



A 3 Wrher the target 

W098 "" n ces one can one or -re of 

reference sequences, e „J»ilax c re£e rence 

tBe ss-ae sequence or a ing ^^^ne 

has the Reouences- ^„w eS , one can a« 

the reference se^ ^ as pr°^- relateo 

fences »^ . particular c ^ o£ . 

„ h etner a « r9 * nce signature, or " or closely 

^^natCe "^soToo: - patterns of 

"tin t- ° n ; e ran; Terence serene- ^ ^ 

gene xn ^_ wpen target an faitixlxes, 

differences between of gene differe nces 

nnvel gene famxlxes, simi iaritxes ana/ algQ 

Z By identify^ the seque nces, ° ne nucle ic 

UKe ' t he reference and targ me of a targe .. 

between the 4t . tan on the cnr« 

polynucleotide P n data by per ule and the 

generating »^f \ . targ et nuclexc »~ betw een the 

20 -action fc*^^ meeting pro.es in the 

probes in the se e aXi d each det ermxne 

target nucleic acxd m idiza tion data g nC e 

ST- -* ^rse^nce signature is 

aether the target^ ^o^J^ nC& signature 
signature. » signature, tn roC essxng 1 9 

«*odi*ent. « array course* set signatures 

* n « signature, tn nucl eotrde sequ lon , or 

30 9 d Xe tf» a^"^ sequence »^s e — 1 P- b e sets. 
en cooin 9 tne ^ ' or natures of *. 

as sn alternate to 9 generic bas es ^ ca 

35 C. T. G "."fences that qu«* 
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second assa y . The Iirsc a q asay T B ZZT t T Bt 3 " rSt ^ 
determination of the nr a « selected to provide a 

to provide a damnation of the presence or 1' 
variant of a second sequence signature It Z . " 
assays employe a high-density n!c le Ic acid a °"* ° f 

analyses the nucleic acid simple ^Lg the f^t °" 
nucleic acid sample is optionally analyst in a ""^ 
depending upon the results of the first assay * "~ * 

In a further embodiment, the first „ T 
sequence signature is a conserved regLn o . ^ 
certain embodiments, the first n . 9<2ne ^"X- la 

a non-conserved region of a l n f si ^ature is 

additionally comprL^e^^ Ll^ngT" " 
said nucleic acid sample. 5 sequence of 

selecting c^TrTa £ 3 — « 

provides a support havi^"^' ° £ * 
it. The support is exnoLn ^ ° DeB associated with 

under low. medium oT^Tstl^ " — "^^otides 
least some hybridization between - P— tt at 

polynucleotides. One identrf^ T 

with the polynucleotides cf ""^ h ^i°i« 

-se not iae.elfieirhybr^rr::: 

h igh-density n 2:^^1:1™° method - the °— - - 

acid sa^le^L ™ s iE ^ * * 
containing nucleic aclcs ^ *" * S -Pla 

contains a sequence 

acid array; and further analysi" L! hl ^- denSlt ^ nucl *ic 

if that sequence signature Js not present " °^ 

let. • • ThlS inVenClon also provides a method for 
determining whether a target molecule has a a ~ 
gene familjr member The «,„, „ , sequence from a 

The met hod involves providing an 
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W • ■«« for each of at least two 

polynucleotide array comprising polynucle otide probes 

different gene family —■^^^J*. fr om the gene 
that define a reference ^ ^ performing a 

hybridization reactio detecting 
mol ecule ana the prober , in th ~» ^ molecule end 

hybridization between the target the 

each of the probes in the sets a P 

hybridization data to deterge wnethe 

acid h as the referen ce sequence fro, ^ ^ 

■nepers. In one ^^"'^nlng whether the target 
nncleic acid molecu 1- ^ridizes to- a gene 

hybridizes to a nucleic acid p embodiments, the 

encoding the gene family « S . prosraIOTa ble digital 
sC ep of processing is cerises, for each 

computer; the polynucleotide array ^ ^ h . ghly 

o£ th e gene family members, a pro de£ ining a highly 

conserved region of the gene and P ^ ^ £urthe r 

variable region of the gene, th p y seCS 
comprises, for each of th< ■ ^ ^ gene ^ 

defining at least two highly varial)le regio ns of 

probe sets defining ^^J^ acid sequence and the 

the gene; the — * de£inln9 ^ 

array further comprises pr the ^no ac id 

of the region of a 3 -'f^ clon provides a computer 

in another aspect. "»\ a ata comprising 

program product for o£ an polynucleotide 

oode that receive, , as input t^JLti-. array; code that 
probe in each feature of P sequ ences from a 

receives as input reference n identifies a 
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between a target nucleic acid molecule and polynucleotide 
probes in che polynucleotide array; code that procLees tne 
hybridization data to determine whether the target nucleic 
acid molecule has a sequence from any of the reference 
sequence*; and a computer readable medium chat sto ™\ he 

In another aspect this invention provides a ^v.~, 
that involves determining whether a target nuclei 
molecule comprises a sequence from one of a set of „. 
method comprises providing a target nucleic a d LLcule ^ 
comprising nucleotide sequences from genomic DNA- providi 
polynucleotide array comprising, for each gene "n the set " 
polynucleotide probes that define at least one sequence 
Signature from a unique region of the gene; generattno 
hybridization data by performing a hybridist ion reason 
between the target nucleic acid molecule and the probesTn the 
sets and detecting hybridization between the target nucleic 
acid molecule and each of the probes in the sets LT 
processing the hybridization data to determine whether the 
target nucleic acid /comprises a sequence from the unique 
region of one of the genes, m one embodiment. theTtTof 
processing is performed by a programmable digical colter 
in another eKbodi^ent. the unique region of the genres ' f or 
an amino acid sequence. In a further embodiment the 
polynucleotide array further comprises, for each of the unioue 
regions, a set of polynucleotide probes whose sequences ZZl 
the degenerate set of nucleotide sequences that encode the 
amxno acid sequence The> r,™>,«« . coae tne 

aMir- Probes xn such embodiments can in 

add-on or as an alternative comprise sequences that contain 

z^oVT such as inosine — — y at the t :L ::::: 

posxtxon. as an even further additional or alternative 
polynucleotide probes can have a mixture of A c T and r ■ 
the third codon position within a sin g ie feature of a " 
polynucleotide array. 

in another aspect, this invention provides a 

c3is r Pr0dUCt a *™ n9 "V***-"!- data 

comprising code that receives as input the sequence of an 

polynucleotide probe in each feature of an polynucleotide 
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• 7 ^Pures from a 

input sequence f ieS a 

code that receives as x«P* code that iden- ^ 

array; code plurality of 9 having P* OD 

unique region of a P c ieotxde * J input hybridxzatxon 

r of features ^ the P rece ives as P nucleic 

define the sequence; code bet .een •J^^ cl . ot «. 
Z* <*- » P ^atU -ta to decer.ine 

n:r:u: Jesses 

array; code aCld » & compute r rea 

W he,ner the signages. 

fr0m an Ihtt scores the codes. 

a „d to execute sort ntio n. ^xgn ee n 5, 

US bv the- present xn moni tor 3. scr one D r 

ner ated by includes a ^ ^ have 

colter syst- ^ ^ e ^ ^ Cabin et 7 ^ ^ 

cabinet 7, W mou8e bu ttons 1 ^ be utxl 

incorporating the pr stot age me ^ 

readable b me mory, tape, components ^no 

. ~ c flash tape . n 4 3r computer 

"rnec 7 also houses ^Irar che lilce . r 

oh as a prooessor. — « «^ bloolc "~~£J~**>J» 
-* F ig • XB b-« S warc tba t can he use, ^ 

„ 1 used to execute so ion . M xn » contp uter 

syS «ted W the P« sent , 3 and keyboard 9 • 

data generated y Bonxtor centra l 

coop uter system - - ^ u =h ^ _ la y 

1 Tof system memory ^. xjo«^ li6 , ne twor* 

^ OCeB9 °U removable die* «*■ ai8 * XX, - 

adapter X08. M 120 * e ^a lx*e 

interface XX8. ^ computer £iash memory. 

r epreeentatxve of reB>ovable ^ ive ot an internal 

tlOW rU- ^ diSk i9 0th« computer systems 
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suitable for use wi>h 

. additional atTJ^Zj™"* ~ i0 " ~* ™™* 

tewer subsystems. F 0r exarnnl ^ 

system could include more than „ no another computer 

encompasses a degenerate se t of eTnucleotiai " N ° :2> ' 
encode le . one of these is ATTGGCAAAG CTATG (SEO U Tn CeS thaC 
probe set of 6-mers based ™ , 0 1D N0:1 >- * 

10 defines this re^^H ^ 

1° NO : 1 ) , Tl^CA (2 . 7 of ~ » - ~ ,1-6 of SE0 

NO:l). GGCAAA (4-9 of SEO ID »o V, ' 3 " 8 ° f SE ° ID 

».»). CAAAGC of Z, In » >' GCftAAG <5 - 10 ° £ SEQ ID 

NO:!,. AAGCTA (S-13 of SEQ ID Z '/ ^ "'^ °* ^ ID 

» and ™ Cl0 - * r^rir 4 ° f se ° id 

sequence within the d M o« . ' Another reference 

N0=_, . A probe see ~ ; ^ «« 0 » 

^ing that defines this S^-fT"" 
NO:_), ACGGAA (2-7 of SEO Tn „n ° f SEQ ID 

— ■ of Lq S ; ,"" 1 8 o of f wo ID 

). AAAGGP (c ^ ' ^"AAGG (5-10 of SEQ ID 

AAAGGC (6-11 of SEQ ID NO: ) AAGGra 

AGGCAA (8-13 of SEO rn ^ ~ ' ° f SE Q 



NO 
NO 
NO 



AGGCAA (8-13 of SEQ ID NO- ) GGPajlT ' ID 
_>, and GCAATG , 10 . 15 of " l^l' {9 " 14 ° f SE ^ => 



(10-15 Of SEQ ID NO: ) 

- de t eo Cin3 :^:s d oTr: e n a e n = e - a *« 

genes in a faaaiy are selecteT • re9i ° DS ° f three 

brackets) . The nucleotide aiSn " Ure ae *>°™° «» 

S enes i. 2 and 3 ^^^1^" ~ " 
A. B. c, D (variabie region v,, TV 7% , Pr0b6S 

° v.» . I. K. L .constant region C,', and „ N o r69i ° n 
region C 2 ) ' ' °» p (constant 



Fig. 4 depicts an example of a stratum, -f 
detecting sconce signatures frl . variety o^ ln 
this example, an polynucleotide array having 525 " L, 
features is provided that contains probes vLh all • 
ner sequences. Two polypentid* ° *" P° s aible 9- 

checked. AsH-Cly.^^ TZTZ' ™ 
Ser-Ehe «a site recognised by protein . ^£ 
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WO 98/U354 ^ enceg AT TGGCAAAG CTfcTG ^ 

^eotide reference , ar e del i 

nucleoli gttTT (SEQ * programs 

CGCCGCGGAA Gi aoeCti veiy- Tne * _ e locatxon on 
ana respect .^t-ifies tne * indicated as 

le tters °« tM use s data t«- «*• 

o£ each of thirty-t*> Tne first «f* B . of the 

-* ^ 7Z . ^.V. - SiMle " ach 

Mt «een «•* fences descried io I^y-nine "~ 

increment and clon es. 

o{ tbe thirty avwtic" 

> T «3> BBSCalPTIOM OP 

DBtM . „ terns have the 
- ^^"^u^ea herein. 

tol Xo«in g -aning- . cowpleI0ent ary" surfaced o 

** Zoning -ogecner^ ^ target its 

„« a tibi lity or ^ < tB target. _ d £ u rtnerttvore . 

-r-^ras ---- ^ r r g u 

^ trace characterises ar ^ary such 

^t-ntary -in.es has 
otner- conv 
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complementary to c and A ±fl complemen . 

genetic code. Complementary al So Zr-TT ^ or u i„ the 

ligano-receptor (also ^o^ll^Zl ^ °* 
interactions, sue, as between 0 ^ " ^ ™ 
■ their agonists, antagonists and rec *Ptors and 

thereto or show -cj^i^/^^ «« hind 

™eot^ 

particular target d«J^- recognized hy a 

~*.yec depending on conf^Y^ , 

refers both to individual polynucleotide " " Pr °" e " 
collection of same-sequence JT T "-olecules and to the 
innnobilized at a ^Z l^lZ TT "««»- 
often used interchanged v * 6 tar 9« »« 

can bind or llT^Tl ZlT * 

ligand-anti-ligand pair The I"* 9 33 Pa " ° f a 

Present invention cL ^ITjlTLT. abound" 
analogs thereof. found xn nature or 

The term "target" reff>r Q 
The probe is useful in ^ > • 3 molecule of interest, 

whether the target — - target: 

"ay be naturally-occurring or ^ T ^ Pr ° be - Tar9ets 

Also, they can be employe Z 21 ""^ -a*"*-. 

a 9 gregates „ith other S ^u". ^"1^" ~ ~ 
covalently or noncovalentlv to » k ! associated, 

directly or via a specif i bind • 9 me0,ber ' elther 

bmdxng substance Tar^^ 

^HLTrr^rtr as -~ T "-- 

-ough secular reco^o™^^ "~ ~ 

-fining AS^^C^T; T ' ^ ~ 
complementary to that of the caroet h partially 
***e the co^arison of target ald ref " ' *" " - * r t0 

the sequence of a target or ™" 3 ™7t VT" ^ 
the probes that interrogate it Z V ^rence and 

complement. 9 " be Siven h «ein as their 

The term "feature- refers rr, =,„ 

having a collection o, substantial!^ °* * SUbs "«* 

^obili,ed polynucleotide P^'a^T^' 

wneraiiy, one feature is 
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WO W tWS P* obeS sequences- 

renc U>an anotne, . £ a tr; di «e,en t °* 
di££ eren<- su£>st antxallY aesi9 ned to ^ 

** atUreS testes. tne 3 P 0 "" 0 " ° ucXeotXde 

cartas ' eaC exarople to ^^-directe* P« £ 8S able 
8equ ences tor £ x^ ^ ^ aXXy ^ 

sequence. * le , a teat 5.384.261. 

sy «nesis. *- 0 . s , Patent « w/9S/1 tf.»- 

syn tnesis „ 5 .« 4 ;; 3 ; t ' ide array" "^ent . Known 

en— Tt^r'te. - 
ee^«f ^cxeot^ ^ tive ^redj ^ a 

surface- * denS ity of at arrays c 

£ eatnres and a *e ^ 0 dx»ents xeae ^ ^ 

°° £ aSnt 625. at ^ ^ lea st one -1 ^ can 

— itY d° af X-et xoo squ are - ^ can .ave 

inoneand. a £eat u«»°» or 9^ ss ss cW e< 

lea8t «xT>y -V — ^^eop. ^ lde ~ * \„ useful 
to merely * glass mxcx" t t o li9 n ^ . nVO ives 

1* substrate to w ^. conta xn 

cna.acteriet.ce. ^ ^ a „ tur^ ^ ^ a ct« _ ^ 
ot material t ^ be a fu , on e can 

tttil «ed to s« £y . tot se e it » 

tnat one m-"-"-- Actional assay to ^ 

0t ^las ant^odv ^J^ica! P^"" ° oe ll 

activity ° r ac - cne* 1051 re sonance, 

eIlZ yme acti ion aS9 ays, tron spi» res useful 

spectroscopy. <^ 89iog# ele ctrop*or 
isoelectric fo 
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screening assays include combinations of these methods 
performed sequentially or at the same tine. It will be 
apparent to those gkiiUH a* 

and western blott- ! that N °" h «™. Southern, 

e^^r^tr..;: it: r*~ - - 

ta.es place before another step. TIL'^JTLT 
screen and prescreen can at tin.es he used interchange^ 

The term "sequence" refers, depending on context 
the nucleotide (base) sequence of a nucleic acid or 
acid sequence of a polypeptide. * 

The term "nucleic acid sequence signature" refers to 
a chosen or reference nucleotide sequence. Sequence 

o. /5. 50, 30, 25 or at most 15 nucleotides in lenoth 
Sequence Signatures include sequences less than 10 15 25 
30, 35. 45, 50. 60, 70, 80, 90, 100. 120. 135 150 17 s l' nn 
250 and 300 nucleotides long. Sequence signatur s\ "so ' 
xnclude any combination of these parameters. Nonlimiting 
examples of nucleic acid sequence signatures occurring in 
nature xnclude. e.g.. the Hogness Box, the TATA box. a 
homeobox, the caat box and Alu repeat sequences 

an amino J£ llZZl^^T^ 

are selected froTche qrouTof twenty 1 ^/ 69 ^ 0 ' 

_ ^ «i twenty common ammo acids and 

a amLTac-r ^ ^ "» ""action "f 

all am.no acrd sequences defined by the sequence signature is 
referred to as the polypeptide "signature set " 

sequences SV^'Z^^T — ° 

r - -*» acids :ur st 

acids, or analogues thereof) is f™ ri a _ 

v. v. rixed. A sequence signature 

can be chosen to be fixed or variable. Sequence signatures 

acids' ^ " Iin0 SeqUenCM ln " hiCh the °* ^> 

acrds that can occupy a variable position is selected from at 

most 15, at most 10. at most 5 or at most 2 of the twenty 
common amino acids other a™,- „ . ^ * 

Sl utner amino acids, including those 

to those silled in the biochemica! arts as the 
amrno acids, are also included. Polypeptide sequence 
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being mos>>- 

WO,8 " BM ■ Tuae ami- acid « " »' 

signatures also include aC „,ost 50 a signat ures 

-, t n 200. 1«' 15 °' .„ polypeptide 3 ^ 120, 100. 

300, 250. 2 in lengt h. »' 180 , 140. 12 

m ost 5 amn° aci chan 2 i5, lengCn . 

a lso i-luda -^ en 0 Ce 20 , l0 , or 5 amino any combination 

90 800.70,50.40.30.2 also incl ce 

„,ids sequence sign p olypep« ae structural 

P olypepcia e Examples or " a other stru 

o£ these P^"! ' zi nc £lnger motif heliceg , 
signatures * limitation coi"; *^ 

JUs including ^ 8 ^tric ^ ^ sequ ence for 
turns, leucme siPP cons ensus r 9 loljull0 s . for 

^rn-str^or ^ iSng steroid- 

hor »one recep ^ t^ ^ exons e ocod 

P° y * The term reg iguQ us nUC * nce of at least 

--^rte^rU g^s > —P- 

r«,cieotide seqw amino acxa 

20 Ttino acids »itnm the 

s-oded W • . unique region- re.ers 
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encoaeu » u nique - 

e by tvo genes. . collection of 

that is nor snared *y refa rs t^ ^ ^ 

The term 9 polypeptide iaenC ity over a 

genes encoding at leas ^ ^ ^ ^ dofflaln8 
Lid seances nave ^ 20 ^no ac ^ gene 

comparison «indo» ancestry « peptide 

are related thr°ug evolution^ » t he «W 

duplication and*-- ^ ^ lnCludl ng . * tin type XII 

S rsTno^o^ulin doma-. ^ domains 

domain, the ' . like domain, o 90 _ p p.38* 

ao main. the cadh«i ^ p ^ Ce » ^ ^ ceU , 3rd 

(D B»to name a f. y> ^ec-lar ^ alao 

B in more detaxl xn R 

discussed xn 64:28 7-3l4- 
Biochem., < l995) 
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Gene families frequently encode polypeptides sharing 
at least one highly conserved region. Two polypeptides share 
a "highly conserved region- if the polypeptides have a 
sequence identity of at least 60% over a comparison window of 
five amino acids, or if they share a sequence identity of at 
least 80 % over a comparison window of ten amino acids. 

Polypeptide members encoded by a gene family, the 
protein family, can have highly variable regions. A "highly 
variable region" of a polypeptide encoded by a gene family 
member is a region of ten amino acids that has less than 50% 
sequence identity with the same region of a polypeptide of 
another gene family member. Protein families that can.be 
interrogated using the present invention include the TNF 
family, the BCL-2 family, act ins, the heat shock proteins, 
keratins, myosin, protein kinases, transcription factors, ' 
tubulins, egg shell proteins, alpha globin, beta-like globins, 
immunoglobins, ovalbumin, transplantation antigens, visual 
pigment protein, and vitellogenin as non-limiting examples. 
See, Vaux, D.L., Cell, Vol. 90, pp. 389-390 (1997) and in its 
entirety; Molecular Biology of the Cell, 3rd Ed., Alberts et 
al., (1994); Avise, J.C., Molecular Markers, Natural History 
and Evolution, Chapman and Hall publishers (1994); Stryer, L. , 
Biochemistry, 3rd. Ed. (1988); and Atassi, M.Z., Molecular 
Immunology, Marcel Dekker, inc. (1984) . 
25 "Pseudogenes" are genomic regions that do not result 

in protein products in the organisms that contain them. 
Pseudogenes have sequence similarities to their true gene 
counterparts. Pseudogenes may arise from duplication of 
ancestral genes except that mutations contained in the 
30 pseudogene interfere with transcription or translation. Lodish 
et al., Molecular Cell Biology, 3rd. Ed, Scientific American, 
Inc., New York, New York (1995). As used herein, pseudogenes 
can be members of gene and protein families that contain their 
functional counterparts. 
35 "Tandem repeat genes" or "tandemly repeated genes" 

encode identical or nearly identical proteins or functional 
RNAs. The copies can appear one after the other separated by 
spacer regions that can vary within an individual. Lodish et 
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^'.r^r ; seq id no:4> > is ° — - 
co sel e CC rtr 1 ^ sr; criceria used 

reterence sequence or s<=>t 
of reference sequences. yuence or set 

"Block tiling- generally refers to a tiling strate.w 

including a set of probes defining a reference sequence in 

which none of the probes in the set overlap in sequence. FO r 

example the reference sequence ATTGGCAAAG CTATG (SEQ ID 

can be blocked tiled by the set, ATTGG ,1-s of SEQ id ' 

CAAAG (6-10 of SEQ ID NO.-l) and CTATT M i ,c T 

_ ana CTATG (ll-is of SEQ ID N0:1) 

Srngle -increment tiling" refers to a tiling 
strategy including a set of probes that defines a reference 
sequence in which each probe in the set overlaps in sequence 
with another probe in the set except fcr a terminal 
nucleotide. Per example, the reference sequence ATTGGCAAAG 
CTATG ,SEQ ID NO:l, can be single- increment tiled by^nTset 
ATTGGC (1-6 Of SEQ ID NO:!,, (2 . 7 „ f 

TGGCAA ,3-8 of SEQ ID HO: 1, , GGCAAA (4-9 of SEQ ID 
GCAAAG ,5-10 of SEQ ID NO=l>. CAAAGC ,6-11 of SEQ ID NO • 1 ) 
AAAGCT „.« of SEQ ID NO: 1) , AAGCTA ,8-13 of SEQ ID 
AGCTAT < 9 -U of SEQ ID NO=l, and GCTATG ,10-15 of SEQ ED ' 

"Double- increment tiling, refers to a tiling 
strategy including a set of probes that defines a reference 
sequence m which each probe in the set overlaps in sequence 
wath another probe in the set except for two. consecutive 
termxnal nucleotides. For- PY^mni « ^ 

ror example, the reference seouenc^ 
ATTGGCAAAG CTATG (SEQ ID NO-1) can ho * m • sequence 
. ■ w ±u NU - X) can fa e double -increment tiled 

by the set ATTGGC (1-6 of SEQ ID N0:l), TGGCAA (3-8 of SEQ ID 
N0:1), GCAAAG (5-10 of SEQ ID N0:1>, AAAGCT (7-12 of SEQ ID 
N0:l> ; AGCTAT , 9 - 14 of SEQ ID ,0:1) and CTATG ( 11 - 15 ^ ID 

"Standard tiling™ refers to a tiling strategy for a 
sub-sequence of a reference sequence. Standard tiling 
includes a set of probes as follows. All nucleotide Positions 
m the sub -sequence are designated fixed, except for one 
which xs designated variable. One probe in the set has '(or 
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tiling. For example, a probe array can contain probes 
defining all possible polynucleotide 9-mers. The computer can 
carry m its memory the location of the feature containing the 
probe having any given 9-mer sequence. Then, relying on the 
reference sequence, the computer can identify the locations of 
all the probes that make up, for example, the single- increment 
tiling set for the reference sequence. Similarly the 
computer can identify the location of all the probes making up 
the standard tiling set for each of the probes defining the 
reference sequence. Then, in processing hybridization data 
the computer can be programmed to examine hybridization ' 
between target and probe at each of the feature locations 
defining the single- increment, standard tiling set. 

The term "degenerate set" refers to the set of all 
nucleotide sequences that encode a particular polypeptide 
sequence signature. 

The term "high discrimination hybridization 
conditions" refers to hybridization conditions in which a 
single base mismatch can be determined. 

Stringency conditions useful in the practice of the 
present invention are set forth in Sandbrook et al., Molecular 
Cloning: A Laboratory Manual, 2d Ed. (1989). 

"Base calling" refers to a process involving 
comparing the nucleotide sequence of a target molecule with a 
reference nucleotide sequence and identifying positions at 
which the nucleotide in the target molecule is different than 
the nucleotide in the reference sequence. "id base calling- 
refers to the process of base calling further involving 
determining the identity of a nucleotide in the target 
molecule that is different than a nucleotide in the same 
position of the reference sequence. 

A target nucleic acid sequence is of "unknown 
genetic origin" if it has not been identified to derive from a 
known genetic locus. 

II- DESCRTPTTHM 

Nucleic acid arrays have been used to interrogate 
single nucleotide differences between reference and target 
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acid is obtained. Alternatively, this xmultistep process can 
be carried out in a single experiment on one array ing 
probes directed to multiple sequence signatures The " th „ 
disclosed herein can also be employed using one' or m 
in serial or parallel fashion. * 0 ™ 3rrayS 

The present invention relies upon the outcomes of 
first determinations to make decisions or further 
determinations until the desired level of information is 
determined. The present invention also provides a Mt l , 
probing for the presence or absence of sequence sLnlt 
their variants in a binary or trinary fashion IZ^ 
analysis asks whether a specific sequence signature Is. 
identified or not; a binary determination is a yes/no 

<Xe termination. A trinsnr = .-, . 

. lnary analysis asks whether a specific 

sequence signature is present ** a ^ ^ P IC 

h , , present, absent, or whether a variant of 

that sequence siqnaturp i a ~v«„ 4 vctnanc ot 

a -/no/varia.t^;™^ - 
quaternary analysis can also ask whether a variant is absent 
and so on The hierarchies contested by the Resent 

aTargeTtoT": 3 »— ^ --ing of 

target to a polymer array followed by at least one other 
array-based determination of interest. 

in binary and trinary analyses in which the goal is 
novel gene discovery, often the most useful information is 
contained in those samples that do not contain a particular 
sequence (a no in both binary and trinary analysis, ^ those 
that contain a variant of a particular sequence ,a variant Z 
trinary analysis, . When doing gene discovery, i , a of " 
enormous benefit to prescreen nucleic acids for those that 
contain a sequence that has already been identified . yes in 

all of the samples that contain the known sequence signature 
to focus rurther study on only those nucleic acids that do not 
contain that sequence signature. When looking for net genes 
■nuch time, labor, and money is saved by narrowing the p oT It 
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nucleotide level so that the novel sequence of the new family 
member is obtained . 

It will also be appreciated that this method is also 
beneficial to industries involving large scale manufacture of 
polymers. In the biotechnical arts, for example, large scale 
recombinant protein synthesis can result in mixtures of 
recombinantly produced polypeptides. In certain cases, for 
example, E. Coli can insert so called "friendly" codons under 
certain fermentation conditions into some but not all of the 
polypeptide product. One can test recombinant protein for the 
presence of sequence signature variants as the first step or 
series of steps in a decision tree. That decision tree can 
involve the segregation of lots that contain the variants from 
those that do not . The variant lots can be discarded or 
further analyzed to the level of detail desired. 

It will also be appreciated that the present 
invention involves the use of an array, preferably a high 
density array, in a least one of the steps of any method 
taught herein. The other steps may be performed using 
techniques known to those skilled in the arts. In addition, 
this application discusses the invention often in terms of 
nucleic acid arrays and nucleic acid analysis. Analysis of 
other materials and the use of other polymer arrays, including 
without limitation polypeptide and polysaccharide arrays is 
contemplated by the present invention. The hierarchy of 
analysis taught herein confers several advantages. One such 
advantage is conferred by looking, for example, at a small 
segment of sequence data - the sequence of the signature or 
sets of signatures instead of the full length sequence - to 
determine what if any further analysis is desired. By taking 
this hierarchical approach, the time, labor, cost, and amount 
of materials involved in handling and manipulating sample for 
analysis can be reduced. 

The present invention not only provides this novel 
hierarchy of analysis, it further teaches that for discovery 
of previously uncharacterized molecules, the most useful 
information can be derived from analysis of those samples that 
have been shown to not contain, or segregated after a 



pC17US97/n002 

WO 98/12354 ^ft 23 

, t . i ikelihood of cofllTining 
screening step to decrease ^ tures or parts thereof. 

pre viously character ~* a sequence signature fro. 

F or example, a sample that co ^ ^ ^ contaln the 

a conserved region of a gene the dif£e rent 

uoiq ue ^ ^ y TiUv contains a novel member of 
members of that gene y {urthe r analysis of that 

the gene family. "/"aesiraMe For example, in such cases, 
sample is P» rticttlarl * chy of analysis, determination 

and at that stage in the ^"^^ o£ the region that 
o£ the full length sequence or at ^ ^ ^ ^ 

di£ fers from the unique region o ot ^ ^ 

putative gene family is q£ che £ulx le ngth sequence 

simultaneously ""^^"J ca n obtain the footprint 
ac the single ""O^ 0 "** 1 *!* See „097/29212 and EP 

or ^ code W bridi2 f/° n 2 Pa £ r i ed' October 20. 1995. published 
application no. " 30747 i 6 - 2 ; 9 £ 9 1 6 led It will be appreciated that 
aa Ep 0717113*2. June 19. 1996. ^ nucle otide 

at xeast in ^"^^TZ the 3 f ootprint or other 
determination can be rnferreo 

Ration pattern.^ ^ 

ar e particularly -ful in the ^ --^or other 
members; the discovery of new g as bein9 from or 

molecules; identify of " C ^ or otherwise,; 

containing certain regl ° M * ially hazardous materials 
ch s handling or • 
including without limxtatron segregation of 

acid material* ~ ^^groupings; epidemiological 

materials into different analysi8 of recombinantly or 

characterisation and analyses analy ^ ^ ^ 

snzymatically ^^"""^^Umitation antisense agents, 
nucleic acid (including without U-^ se5<iences , res triction 

riboZ ymes. promoter talle<J sequences, branched 

site sequences, capped "<^ 3 ' vector sequences, analogues 

5 sequences. ^^"VTacias' and other sequences or 
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and protein or peptide analogues) carbohydrate of all forms 
and analogues thereof, proteoglycans , and filamentous 
materials including without limitation those containing 
fibrins , actins , myosin, tropomyosin, troponin, and 
meromyosin; quality control and assurance for manufactured 
biological materials, natural or synthetic polymers, or other 
chemical materials; the narrowing of a clone pool and others. 

All of the methods discussed herein can include 
correlating RNA levels with gene sequences of interest, the 
identification and use of expression patterns, and the 
narrowing of expression pattern information in a hierarchical 
fashion, or the selection, including by experimental design, 
of subsets of particular expression profiles. For example, 
one can look for the sequence signatures of enzymes involved 
in a particular metabolic pathway. If one or more of the 
sequence signatures are missing, a second assay can be for the 
sequence signatures of other enzymes that can or are thought 
to metabolize the excess accumulation of bioproducts that 
results from the enzyme deficiency screened for in the first 
assay. 

A. Screening Methods 

1. Analyzing For Sequence Signatures 
In one aspect, this invention provides methods that 
involve analyzing a nucleic acid molecule for the presence of 
a sequence signature. Such analysis involves starting with an 
polynucleotide array that contains a set of probes that define 
the sequence signature; generating hybridization data by 
performing a hybridization assay between the target and the 
array and detecting hybridization between the target and the 
probes in the array, and processing the hybridization data to 
determine whether the target has the sequence signature. 

The probes required on the polynucleotide array 
depend upon the sequence signature to be analyzed. The 
sequence signature can be, for example, an amino acid sequence 
or a nucleotide sequence. The sequence signature could 
define, for example, a polypeptide domain. The sequence 
signature could be a fixed sequence or a consensus sequence in 
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contain generic bases such as inosine or mixtures of A, C, T. 
G - or U at the equivalent of the third codon position in the 
sequence. 

One then carries out a hybridization reaction in 
which the target nucleic acid sequence is contacted with the 
polynucleotide probe under hybridization conditions. If the 
target nucleic acid molecule is very long, one can optionally 
break the target into fragments and contact the array with the 
fragments. Usually the target or fragments thereof are 
detectably labelled so that the positions at which they have 
hybridized can be determined. 

After carrying out the hybridization reaction, 
hybridization is detected between selected probes and the 
target to generate hybridization data. This data usually 
reflects the amount of hybridization, as determined by the 
strength of the detectable signal (fluorescence for example), 
between the target and the probes at a particular feature. 
One can use high, intermediate, or low discrimination 
hybridization conditions as desired. 

The hybridization data is then processed, preferably 
by programmable digital computer, to determine whether the 
target contains a nucleotide reference sequence defined by any 
probe set. Processing the hybridization information can 
comprise determining the degree of fidelity of hybridization 
between the target nucleic acid molecule and each probe in the 
set, whereby hybridization with high fidelity to all the 
probes in the set indicates that the target nucleic acid 
xaolecule has the sequence signature, and hybridization with 
high fidelity to a subset of the probes in the set indicates 
that the target nucleic acid molecule has part of the sequence 
signature. 

For example, suppose one desired to determine 
whether a target polynucleotide encoded the amino acid 
sequence RRGSV (SEQ ID N0:__) - As stated above, 3456 
nucleotide sequences encode this amino acid sequence. An 
array can be selected that includes probe sets using a single- 
increment tiling strategy defining the degenerate set of 
nucleotide sequences that encodes RRGSV (SEQ ID N0:_) . 



27 



PCT/US97/17002 



WO 98/12354 w 

SupP ose, ^ther. that £^ , „hich encodes ««V 

sequence CGACGAGGGTC^TC «* cion hy hridization 

(SEQ ID N0 = _» • Under t hl s 9 e ^ nce would hybridize to the 
conditions, this targe t seque ^ ast£risk: 

singl e- increment probe set as P 

CGACGAGGGTCIOTC (SEQ ID »;-.) 

♦cgacga 13 " — 

♦GACGAG 
♦ ACGAGG 



Signature: 
Reference : 
probes : 



♦CGAGGG 
♦GAGGGT 
*AGGGTC 
♦GGGTCT 
♦GGTCTG 
♦GTCTGT 
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Signature: 

Target : 
Reference: 
Probes: CGCCGA q 

CCGAGG 
♦CGAGGG 
♦GAGGGT 
♦AGGGTC 
GGGTCC 
GGTCCG 
GTCCGG 
TCCGGG 
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The target may hybridize to part of a reference 
nucleotide sequence but it may not hybridize at positions 
representing particular codons. In this case, the target does 
not encode the polypeptide sequence signature, but may encode 
a related sequence signature which varies from the original as 
a result of a variable amino acid position. If the array 
contains probes defining sequence signatures that include such 
variable positions, the computer system can process the 
hybridization data from the probe sets defining these other 
sequence signatures, to determine whether the target encodes 
one of theses. If the target fails to hybridize to the probes 
defining a sequence signature, then the target does not. encode 
this sequence signature. 

The array need not include probes defining the 
degenerate set of nucleotide sequences encoding a polypeptide 
sequence signature. As an alternative to a degenerate set of 
nucleotide sequences, one can provide for generic bases such 
as inosine or mixtures of A, C, T, G, and U at what 
corresponds to the third codon position. In addition, one can 
employ footprint, molecular bar- coding, or other hybridization 
patterns to determine the presence of, absence of, or variance 
from the reference sequence signature. 

In another embodiment of this method, the array 
further comprises probe sets selected for standard tiling of a 
reference sequence. Suppose, for example, that as a result of 
mutation, the target nucleic acid has the sequence CGA CGA tGG 
TCT GTC (SEQ ID N0:_) , which encodes RRWSV (SEQ ID NO:_) . A 
probe set that is standard tiled throughout the reference 
sequence may include probe sets that hybridize to the target 
as follows: 



Signature: 
Reference: 
Target : 
Probes : 



R R S V G (SEQ ID NO 
CGACGAGGGTCTGTC (SEQ ID NO 
CGACGAtGGTCTGTC (SEQ ID NO 



♦CGACGA 
CGtCGA 
CGgCGA 
CGcCGA 
GACGAG 
GAaGAG 
GAtGAG 
GAgGAG 
ACGAGG 



(SEQ ID NO 
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ACaAGG 
ACtAGG 
ACCAGG 
CGAGGG 
CGtGGG 
CGgGGG 
CGcGGG 
GAGGGT 
GAaGGT 
♦GAtGGT 
GACGGT 
AGGGTC 
AGaGTC 
AGtGTC 
AGcGTC 
GGGTCT 
GGaTCT 
GGtTCT 
GGcTCT 
♦GGTCTG 
GGaCTG 
GGgCTG 
GGcCTG 
♦GTCTGT 
GTaTGT 
GTtTGT 
GTgTGT 
♦TCTGTC 
TCaGTC 
TCgGTC 
TCcGTC 
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. one can determine that the target does 
Fr om this ^^^J ^ce, hut has the sequence: 
not en code the signature segu ^ 

*CGACGA (SEQ ID NO: ). 

* GGTCTG (SEQ ID NO: ) . 

♦GTCTGT (SEQ 1£ » . r 

♦TCTGTC (SEQ ID NO.__ ^.^ 
CGACGAtGGTCTGTC (SEQ ID NO.__>, 



encodes 



W S V. 



rr „ 0 rhiD® software from 
Software such ae the GeneChJ ^ ^ ^ ^ 

M£ym et r i*. me (Santa Clara. C* plication - 
the hybridization data See ion p^icatxon Ho. EP 

07171X3^ (European Patent 
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2 . careening For Members O f A Gene Family 
In another aspect, this invention provides methods 
for determining whether a target nucleic acid molecule encodes 
a member of a gene family. This method is useful for 
determining whether a target molecule is a known member of a 
family, or a new, previously unknown, member. In selecting 
arrays for this type of screening, several parameters can be 
varied. 

One parameter is the number of gene family members 
whose sequences are used on the array. Probe sets defining 
sequences from at least one and more preferably at least two 
members of the family are used on the array. However,, for the 
identification of new family members, one preferably creates 
arrays containing probe sets defining sequences from all known 

members of the family. 

Another parameter that can be varied is the number 
of sequence signatures from each member of the gene family 
that are defined by probe sets on the array. A comparison of 
the amino acid and nucleotide sequences of known members of a 
gene family reveals both highly conserved and variable 
sequence regions. Conserved regions, because they share a 
higher degree of identity between members, are more useful for 
determining whether a target encodes a member of the family. 
Variable regions, because they are the most distinct, are more 
useful for discriminating between members of the family and 
for indicating whether a target encodes a new member of the 
family. Accordingly, arrays used for screening members of a 
family contain probe sets defining at least one sequence 
signature from each member of the gene family. 

Another parameter that can be varied, related to the 
second parameter, is the number of nucleotide sequences within 
a degenerate set encoding an amino acid signature sequence 
from one or more of the gene family members from which probe 
sets are chosen. For example, a nucleic acid signature 
sequence from a member of a gene family, if it is within the 
coding region of the gene, encodes an amino acid sequence. 
Probe sets can be selected that define not only the reference 
nucleotide sequence, but members of the degenerate set that 
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encode the same amino acid sequence as the refWence 
nucleotide sequence. Such probe sets are useful in 
identifying polymorphisms of any gene family member, as well 
as new members of the family. Generic bases and probes having 
5 mixtures of bases at certain codon positions, such as the 
third codon position can also be employed. 

Another parameter is the length of the sequence 
signature. While there is no particular size limit, 
generally, sequence signatures preferably at least 15 

10 nucleotides longs. A collection of sequence signatures 

totalling between 75 and 125 nucleotides spread among about 4 
signatures is particularly useful. 

Any nucleic acid molecule can be used as a target 
molecule in this method. However, often, the target is a 

15 molecule that has been pre -screened in accordance with the 

teachings of the present invention so that there is reason to 
believe the target may be a member of the gene family. For 
example, one may screen a DNA library with probes (which can 
include degenerate sets, generic bases, and mixtures of 

20 nucleotides at certain positions) having a sequence selected 
from one or more members of the gene family. Depending upon 
the stringency of the hybridization conditions used, the probe 
may hybridize to sequences more closely or more distantly 
related to the probe. Thus, the target sequence can be one 

25 that hybridizes under a selected set of hybridization 
conditions to a probe having the reference sequence. 

The hybridization data generated from a 
hybridization reaction between the target and the probes on 
the array is processed to determine whether the data is 

30 consistent with the target nucleic acid being a member of the 
gene family. This can involve, for example, base calling the 
target sequence over at least a sequence signature for a 
conserved region of the gene or the determination of whether 
the overall pattern expected for that sequence signature is 

35 present. 

The hybridization data may indicate that the target 
molecule has sequences that are identical to that of a known 
member of the gene family. However, if the hybridization data 
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indicates that there are difference, between the target 
seouence and the reference sequences, the extent of the 
Terences provides further information about the identxty of 

the target sequence. 

For example, if the differences are few enough, 
their location and identity can in certain embodiments be 
determined by X. base calling using, e.g arrays that employ 
single increment, standard tiling. In this case, the 
information is consistent with the target bexng one of the 
^owTgene family members, possibly including allelxc forms of 

the gene. ^ are significant differences between. the 

target and the probe sets, then the hybridization is g»«-lly 
1L weaK in the regions that differ. In thxs case, the 
Trget is identified as containing an insert that xs not a 
"eviously Known member of the family. The practitxoner then 
can decide whether the clone is worth sequencing to determine 
if it is actually a member of the family, and, if so, how xt 
differs from the other members. 

3 jji- r^nina rcane sequences 
As we move into a world in which all the genes of 
the human and other genomes are identified and sequenced, the 
ocusTmuch nucleic acid analyses will be the identif xcatxon 
of which genes are present in a particular sample. Such 

flltification is particularly useful in the hierarchical 
Ifthods of the present invention. Accordingly, this invents 
Ilso provides methods of determining whether a target nucleic 
aoid molecule has a nucleotide sequence from any of a set o 
gene s The methods involve providing an array wxth probe sets 
defining sequence signatures from the gene set. Hybrxdxzatxon 
data ie collected from a hybridization reaction between the 
target and the probes on the array. The data is analyzed to 
^ermine whether the target contains the sequence signature 
of from one of the genes in the set. 

The hybridization data can be processed m the 
following manner. The extent of hybridization between the 
probes that define each seance signature and the target can 
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• A ifThe target has a sequence^osely related 
be determined. If tne a deg ree of hybridization 

between the 3trong compared to the 

signature o£ that gene » sequence signatures defined rn 

hybridiZaCi0 ; h B is 9 rs i^ts in identifying sequence signatures in 
the array »«a ^ ^ analy2ing hyb ri dl zat,on 

rtaTr' nucXrU arrays are taught hy BP 
public atio ^° er ^ y 7 :;:- 2 se<IuenC e signatures are unique to the 

genes in the set * ^^^^J^JL*, * 
nucleotides suffices rn most cases t qu ^ 

gene . - signature serene . j-^ ^ ^ _ ple 

rnltermln^rr Entity of target cDNA molecules , 

Giants ^'further information ahout a target 

arravs with probe sets in single- 
sequence hy ^^.^^.V each nucleotide in each 

—U^turel ^n this r 

- -rr^'rr^r^ base ^. 

rt^tly r-gnition of the hybridisation pattern rs 
employed. 

B pexformi nj Hybridization *S3aYS 
' Hybridization assays on substrate -bound 
. Kde arrays involve a hybridization step and a 
polynucleotide arrays gtep> a hybriaiza tion 

detection step In the y rably , a hyb ridization 

mi xture containing the ^^jj*^ agent, denaturing 
optimizing agent, such as an i into cont act with 

ch e probes of ^e array ization betwe en the target 

a time appropriate to allow hy t 
and any con^lementary probes. J^ 11 ^ with . 

molecules are then removed from the array y 

=„ mixture that does npt contain the target, sucn a 
Creation buffer. Thie leaves only bound target 
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molecules- In the detection step, the probes to which the 
target has hybridized are identified. Since the nucleotide 
sequence of the probes at each feature is known, identifying 
the locations at which target has bound provides information 
about the particular sequences of these probes. 

The hybridization mixture includes the target 
nucleic acid molecule and a hybridization optimizing agent in 
an appropriate solution, i.e., a hybridization buffer. The 
target nucleic acid molecule is present in the mixture at a 
concentration between about 0.005 nM and about 50 nM 
preferably between about 0.5 nM and 5 nM or, more preferably, 
about 1 nM and 2 nM. The target nucleic acid molecule 
preferably includes a detectable label, such as a fluorescent 
label . 

Betaines and lower tetraalkyl ammonium salts are 
examples of isostabilizing agents. Denaturing agents are 
compositions that lower the melting temperature of double 
stranded nucleic acid molecules by interfering with hydrogen 
bonding between bases in a double- stranded nucleic acid or the 
hydration of nucleic acid molecules. Denaturing agents 
include formamide, formaldehyde, DMSO ( "dimethyl sulfoxide" ) , 
tetraethyl acetate, urea, GuSCN, glycerol and chaotropxc 
salts Hybridization accelerants include heterogenous nuclear 
ribonucleoprotein ("hnRP") Al and cationic detergents such as, 
preferably, CTAB ( "cetyltrimethylammonium bromide" } and DTAB 
("dodecyl trimethylammonium bromide"), and, also, polylysrne, 
spermine, spermidine, single stranded binding protein ("SSB"), 
phage T4 gene 32 protein and a mixture of ammonium acetate and 

ethanol. . 

The hybridization mixture is placed in contact with 

the array and incubated. Contact can take place in any 

suitable container, for example, a dish or a cell specially 

designed to hold the array and to allow introduction of the 

fluid into and removal of it from the cell so as to contact 

the array. Generally, incubation will be at temperatures 

normally used for hybridization of nucleic acids, for example. 

between about 20* C and about 75° C, e.g., about 25- C about 

30° C about 35- C about 40- C, about 45° C. about 50- C, 
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about 55° C, abou^O" C or about 65° C. Fo™robes longer 
than about 14 nucleotides, 20° C - 50° C is preferred. For 
shorter probes, lower temperatures are is preferred. The 
target is incubated with the probe array for a time sufficient 
to allow the desired level of hybridization between the target 
and any complementary probes in the array. Using a 
hybridization temperature of 25° C can yield a very clear 
signal, usually in at least 30 minutes to two hours, but it 
may be desirable to hybridize longer, i.e., about 15 hours. 

After incubation with the hybridization mixture, the 
array usually is washed with the hybridization buffer, which 
also can include the hybridization optimizing agent. These 
agents can be included in the same range of amounts as for the 
hybridization step, or they can be eliminated altogether. 
Then the array can be examined to identify the probes to which 
the target has hybridized. 

C. Preparation of Target Samples 

The target polynucleotide whose sequence is to be 
determined can be isolated from a clone, a cDNA, genomic DNA, 
RNA, cultured cells, or a tissue sample. If the target is 
genomic, the sample may be from any tissue (except exclusively 
red blood cells) . For example, whole blood, peripheral blood 
lymphocytes or PBMC, skin, hair or semen are convenient 
sources of clinical samples. These sources are also suitable 
if the target is RNA. Blood and other body fluids are also a 
convenient source for isolating viral nucleic acids. If the 
target is mRNA, the sample is obtained from a tissue in which 
the mRNA is expressed. If the polynucleotide in the sample is 
RNA, it is usually reverse transcribed to DNA. DNA samples or 
cDNA resulting from reverse transcription are usually 
amplified, e.g., by PCR. Depending on the selection of 
primers and amplifying enzyme (s), the amplification product 
can be RNA or DNA. Paired primers are selected to flank the 
borders of a target polynucleotide of interest. More than one 
target can be simultaneously amplified by multiplex PCR in 
which multiple paired primers are employed. If the target is 
a long polynucleotide, it may be appropriate to fragment the. 
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target into smaller pieces before performing the hybridization 
reaction. As used herein, the detection of hybridization 
between a target and probes on an array includes performing 
the hybridization reaction with all or portions of the target. 

The target can be labelled at one or more 
nucleotides during or after amplification. For some target 
polynucleotides (depending on size of sample), e.g., episomal 
DNA, sufficient DNA is present in the tissue sample to 
dispense with the amplification step. Preferred labels 
include fluorescent labels, chemi- luminescent labels, bio- 
luminescent labels, and colorimetric labels, among others. 
Most preferably, the label is a fluorescent label such as a 
fluorescein, a rhodamine, a polymethine dye derivative, a 
phosphor, and so forth. Commercially available fluorescent 
labels include, inter alia, fluorescein phosphoramidites such 
as Fluoreprime (Pharmacia, Piscataway, NJ) , Fluoredite 
(Millipore, Bedford, MA) and FAM (ABI, Foster City, CA) . 

Useful light scattering labels include large 
colloids, and especially the metal colloids such as those from 
gold, selenium, silver, tin, and titanium oxide. 

Radioactive labels include, for example, 32 P. This 
label can be detected by a phosphoimager . Detection, of 
course, depends on the resolution of the imager. 
Phosophoimagers are available having resolution of 50 microns. 
Accordingly, this label is currently useful with chips having 
features of at least that size. 

In one embodiment, biotinylated bases are 
incorporated into the target nucleic acid. Hybridization is 
detected by staining with streptavidin-phycoerythrin. 

When the target strand is prepared in single - 
stranded form as in preparation of target RNA, the sense of 
the strand should of course be complementary to that of the 
probes on the chip. This is achieved as an example by 
appropriate selection of primers used for any amplification of 
the target. Also, the array can contain probes for both 
strands . 

The target is preferably fragmented before 
application to the chip to reduce or eliminate the formation 
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reauc 



of secondary structures in the target and reauce any overhang 
interaction. The average size of targets segments following 
hybridization is usually larger than the size of probe on the 
chip . 

5 

D. Substrate-Associated Polynucleotide Arrays 

Substrate-associated polynucleotide arrays used in 
the assays of this invention typically include between about 5 
x 10 2 and about 10 8 features per square centimeter, or between 

10 about 10 4 and about 10 7 , or between about 10 5 and 10 6 . 

Preferably, the arrays are produced through 
spatially directed polynucleotide synthesis. As used herein, 
"spatially directed polynucleotide synthesis" refers to. any 
method of directing the synthesis of an polynucleotide to a 

15 specific location on a substrate . Methods for spatially 

directed polynucleotide synthesis include, without limitation, 
light -directed polynucleotide synthesis, microlithography , 
application by ink jet, microchannel deposition to specific 
locations and sequestration with physical barriers. In 

20 general these methods involve generating active sites, usually 
by removing protective groups; and coupling. to the active site 
a nucleotide which, itself, optionally has a protected active 
site if further nucleotide coupling is desired. 

In one embodiment substrate -bound polynucleotide 

25 arrays are synthesized at specific locations by light- directed 
polynucleotide synthesis. The pioneering techniques of this 
method are di&closed in U.S. Patent No. 5,143,854; PCT WO 
92/10092; PCT WO 90/15070; and United States Application 
Serial Nos. 08/249,188, filed May 24, 1994, 07/624,120, filed 

30 December 6, 1990, and 08/082,937, filed June 25, 1993. In a 

basic strategy of this process, the surface of a solid support 
modified with linkers and photolabile protecting groups is 
illuminated through a photolithographic mask, yielding 
reactive hydroxyl groups in the illuminated regions. A 3'-0- 

35 phosphoramidite- activated deoxynucleoside (protected at the 

5' -hydroxyl with a photolabile group) is then presented to the 
surface and coupling occurs at sites that were exposed to 
light. Following the optional capping of unreacted active 
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sites and oxidation, the substrate is rinsed and the surface 
is illuminated through a second ^. to expose additional 
hydroxyl groups for coupling to the linker. A ■•««"» * " 
protected 3' -O-phosphoramidite-activated deoxynucleoside «c- 
v, is presented to the surface. The selective 
LtodeProtection and coupling cycles are repeated until the 
P rired eet of products is obtained. ^o^J^s are 
rh „ ODC ionally removed and the sequence is. thereafter, 
optionally capped, side chain protective groups, if present, 
«e also removed. Since photolithography is used, the process 
cL be miniaturized to generate high-density arrays of 
polynucleotide probe. ^ _ ^ _ 

the nucleotides can be natural nucleotides, chemically 

Lfied nucleotides or nucleotide analogs, as long as they 
modified nucleotide (h . lch the linking 

have activated hydroxyl groups compatible »itn 
chemistry. The protective groups can. themselves, be 
^to^le. Alternatively, the protective groups can be 
labile under certain chemical conditions, e.g., acid. In 
sxa^le. the surface of the solid support can contain a 
^»ition that generates acids upon exposure to light, 
composition that g subs trate to light generates 

Thus, exposure of a region of the s che 
acids in that region that remove the protective g P 

. *i=„ n, svnthesis method can use 3'- 

exposed region. Also, the de0 xynucleoside . In 

protected 5 • -O-phosphoramidite-activated deo y 
chis case, the polynucleotide is synthesized in the 5 to 
direction, which results in a free 5' s by 

The general process of removing protective g P r 
exposure to light, coupling nucleotides (optionally competent 
for further coupling) to the exposed active 
optionally capping unreacted sites is referred to herein 

— « T adapted 
fo r various tasks, such ^^^T^ T. 
Sr^ST^^ filed October 

26, 1994. 
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If desired, the substrate -bound po^nucleotide array 
can be appropriately packaged for use in a chip reader. One 
such apparatus is disclosed in International Publication No. 
WO 95/33846. 

Probes may be laid out on an polynucleotide array 
with a specifically defined positional relationship. For 
example, the probes in the set can be positioned in adjacent 
features on the array. However, hybridization data from an 
polynucleotide array normally will be processed by a 
programmable digital computer. The computer memory can be 
programmed to remember the sequence of each probe at each 
feature on the array. Consequently, one may provide an 
polynucleotide array or set polynucleotide arrays containing 
all possible sequences of probes of a given length. For 
example, a chip having 525 by 525, or 275,625, features can 
contain all nine-mer probes having all possible nucleotide 
sequences of 9 nucleotides (4 9 = 262,144). Using any selected 
tiling strategy, the programmable computer can identify the 
set of features containing probes that define any given 
reference sequence. Then, the computer can be programmed to 
process hybridization data from the probe set that defines a 
reference sequence. 

E - Detecting Flimrp gcently Labelled Probes 

Determining a signal generated from a detectable 
label on an array requires an polynucleotide array or chip 
reader. The nature of the polynucleotide array reader depends 
upon the particular type of label attached to the target 
molecules. 

In one embodiment the chip reader comprises a body 
for immobilizing the polynucleotide array. Excitation 
radiation, from an excitation source having a first 
wavelength, passes through excitation optics from below the 
array. The excitation optics cause the excitation radiation 
to excite a region of an polynucleotide array on the 
substrate. In response, labeled material on the sample emits 
radiation which has a wavelength that is different from the 
excitation wavelength. Collection optics, also below the 
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^.n collect the emission from the sample and image it 
array, then oUect genera tes a signal proportxonal 

onto a detector. The detecto g signals can be 

radiation sensed thereon. me => ^ 
to the amount of radxatx associate d with the plurality 

assembled to represent an xmage assocxa 
of regions from which the emission originated 
° f A ccording to one embodiment, a multx-ax.s 

translation stage moves the poly = ^ ^^^^ 
posi tion d, o ^ a regult( a 2 . 

lo catxons of an polynuc leotide array is obtained. 

dimensional xmage of the can include an auto - 

The polynucleotxde array reau 
, in. future to maintain the san^le in the £ocal plana of 
focusing featur< V through out the scanning proceBB. 
che ^^"^f^rtroiler may he employed to maintain 
Purther. a teIrperature while it is heing 

T £ Lit! axiB translation stage, temperature 
scanned The « £eacure , ^ elec tronics associated 

.^riately programmed digital computer, 
approprxatexy a focused onto a spot of 

In one embodiment, a beam xs rocu 

t-Ho aurface of the array usxng, for 
about 2 m» in diameter on the or Qther optical 

example, the objective lens of a * ^ 

means to control beam diameter. (See, e.g _ 
means w rtQ /T Qt; nag filed February 10, 1994). 

patent application 08/195 889 ' fl bes are 

xn another embodxment, fluorescent. P 
in anocn ima aing systems. Detaxls of 

this method are descrroe 

„umber 08/30!. 051. filed ^ Cypica lly the light 

co^ercially available ""f"^*. Sector is 

source is plaoed above an array, and *J*° to e 
bel ow the array. For the present ™^ ™JJ Jn one 
can be replaced with a higher power ^ ^Ts'sed. but the 
sediment, the standard absorption^ ^ iMging 
photodiode detector is replaced with a CCD ^ 

5 optics to allow ^^Hn the optical path 

^rt^ i ar^t while allowing the emission 
It ^sHo Rector. Xn a variation of this method, a 
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fiber optic imaging bundle is utilized to bring the light to 
the CCD detector. In another embodiment, the laser is placed 
below the polynucleotide array and light directed through the 
transparent wafer or base that forms the bottom of the 
5 polynucleotide array. In another embodiment, the CCD array is 
built into the wafer of the polynucleotide array. 

The choice of the CCD array will depend on . the 
number of polynucleotides in each array. If 2500 features of 
sequence- specific polynucleotides nominally arranged in a 

10 square (50 x 50) are examined, and 6 lines in each feature are 
sampled to obtain a good image, then a CCD array of 300 x 300 
pixels is desirable in this area. However, if an individual 
array has 48,400 features (220 x 220) then a CCD array with 
1320 x 1320 pixels is desirable. CCD detectors are 

15 commercially available from, e.g., Princeton Instruments, 
which can meet either of these requirements. 

The detection device also can include a line 
scanner, as described in United States patent application 
08/301,051, filed September 2, 1994. Excitation optics 

20 focuses excitation light to a line at a sample, simultaneously 
scanning or imaging a strip of the sample. Surface-bound 
fluorescent labels from the array fluoresce in response to the 
light. Collection optics image the emission onto a linear 
array of light detectors. By employing confocal techniques, 

25 substantially only emission from the light's focal plane is 

imaged. Once a strip has been scanned, the data representing 
the 1- dimensional image are stored in the memory of a 
computer. According to one embodiment, a multi-axis 
translation stage moves the device at a constant velocity to 

30 continuously integrate and process data. Alternatively, 

galvometric scanners or rotating polyhedral mirrors may be 
employed to scan the excitation light across the sample. As a 
result, a 2 -dimensional image of the sample is obtained. 

In another embodiment, collection optics direct the 

35 emission to a spectrograph which images an emission spectrum 
onto a 2-dimensional array of light detectors. By using a 
spectrograph, a full spectrally resolved image of the array is 
obtained . 
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The read time for an polynucleotide array will 
depend on the photophysics of the fluorophore (i.e., 
fluorescence quantum yield and photodestruction yield) as well 
as the sensitivity of the detector. For fluorescein, 
sufficient signal -to -noise to read a chip image with a CCD 
detector can be obtained in about 30 seconds using 3 mW/cm 2 
and 488 nm excitation from an Ar ion laser or lamp. By 
increasing the laser power, and switching to dyes such as CY3 
or CY5 which have lower photodestruction yields and whose 
emission more closely matches the sensitivity maximum of the 
CCD detector, one easily is able to read each array in less 
than 5 seconds. 

F. Data Analysis 

Data generated in hybridization assays is most 
easily analyzed with the use of a programmable digital 
computer. The computer program generally contains a readable 
medium that stores the codes. Certain files are devoted to 
memory that includes the location of each feature and the 
sequence of the polynucleotide probe at that feature. Because 
analysis often involves comparing the sequence of a target to 
a reference sequence, the program also can include in its 
memory the reference sequence. Using this information, the 
program can then identify the set of features on the array 
whose probes define the reference sequence in the selected 
tiling strategy. The computer also contains code that 
receives as input hybridization data from a hybridization 
reaction between a target nucleic acid molecule and 
polynucleotide probes in the polynucleotide array. The 
computer also contains code that processes the hybridization 
data. The computer program also can include code that 
receives instructions from a programmer as input. 

The computer can transform the data into another 
format for presentation. Data analysis can include the steps 
of determining, e.g., fluorescent intensity as a function of 
substrate position from the data collected, removing 
"outliers" (data deviating from a predetermined statistical 
distribution) , and calculating the relative binding affinity 
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of the targets f^Ri the remaining data. Tfl^^resul ting data 
can be displayed as an image with color in each region varying 
according to the light emission or binding affinity between 
targets and probes therein. 

One application of this system when coupled with the 
CCD imaging system that speeds performance when the detection 
step involves hybridization of a labeled target polynucleotide 
with an polynucleotide in the array is to obtain results of 
the assay by examining the on- or off -rates of the 
hybridization. In one version of this method, the amount of 
binding at each address is determined at several time points 
after the targets are contacted with the array. The amount of 
total hybridization can be determined as a function of the 
kinetics of binding based on the amount of binding at each 
time point. Thus, it is not necessary to wait for equilibrium 
to be reached. The dependence of the hybridization rate for 
different polynucleotides on temperature, sample agitation, 
washing conditions (e.g., pH, solvent characteristics, 
temperature) can easily be determined in order to maximize the 
conditions for rate and signal - to- noise . Alternative methods 
are described in Fodor et al . , United States patent 5,324,633, 
incorporated herein by reference. 

The dependence of the hybridization rate for 
different polynucleotides on temperature, sample agitation, 
washing conditions (e.g., pH, solvent characteristics, 
temperature) can easily be determined in order to maximize the 
conditions for rate and signal-to-noise. 

G. Mechanics of Assays 

Assays on polynucleotide arrays generally include 
contacting an polynucleotide array with a labelled sample 
under the selected reaction conditions, optionally washing the 
array to remove unreacted molecules, and analyzing the 
biological array for evidence of reaction between target 
molecules the probes. These steps involve handling fluids. 
These steps can be automated using automated fluid handling 
systems for concurrently performing the detection steps on the 
array. Fluid handling allows uniform treatment of samples in 
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the wells. Microtiter robotic and fluid- handling devices are 
available commercially, for example, from Tecan AG. 

The chip can be manipulated by a fluid -handling 
device This robotic device can be programmed to set 
appropriate reaction conditions, such as temperature, add 
reagents to the chip, incubate the chip for an appropriate 
time remove unreacted material, wash the chip substrate, add 
reaction substrates as appropriate and perform detectxon 
assays The particulars of the reaction conditions are chosen 
depends upon the purpose of the assay, for example 
hybridization of a probe or attachment of a label to 
polynucleotides.^^ ^ ^ ^ ^ appropriately packaged 
for use in chip reader. One such apparatus is disclosed m 
United States- patent application 08/255,682, filed June 8, 
1994. 

H . o ^h.fraf.a-a pf"""" 1 PoVTllirl entitle Array 

« Ul " StM ^ mafcing a chip, the substrate an, its surface 
preferably form a rigid support on which the sample can be 
fold. The substrate and its surface are also chosen to 
provide appropriate light -absorbing ^aracterrstzcs. For 
Instance, the substrate may be functionalxzed glass. Sx Ge 
GaAs GaP. Sx0 2 . SiN«. modified silicon, or any one of a wxde 
variety \f gels or pollers such as ,poly» tetraf luoroethylene. 
(p oly>vinylidenedifluoride, polystyrene, polycarbonate, or 
combinations thereof. Other substrate materials wxll 
be readily apparent to those skilled in the art upon revxew of 
this disclosure. In a preferred embodiment the substrate xs 

•Flat qlass or silica- 

surfaces on the solid substrate usually, though not 

always, are composed of the same material as the substrate. 

Thus the surface may be composed of any of a wide varxety of 

materials, for example, polymers, plastics, resxns, 
polysaccharides, silica or silica-based materxals carbon, 
mttlls, inorganic glasses, membranes, or any of the above- 
listed substrate materials. In one embodiment, the surface 
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v transparent and will have sur 



will be optically transparent and will have surf ace Si-OH 
functionalities, such as those found on silica surfaces. 

Preferably, polynucleotides are arrayed on a chip in 
addressable rows and columns. Technologies already have been 
5 developed to read information from such arrays. The amount of 
information that can be stored on each chip depends on the 
lithographic density which is used to synthesize the wafer. 
For example, if each feature size is about 100 microns on a 
side, each chip can have about 10,000 probe addresses 
10 (features) in a 1 cm 2 area. 

The following examples are offered by way of 
illustration, not by way of limitation. 

EXAMPLE 

15 The method of the invention was used to screen for 

new members of the TGF-0 superfamily of proteins. There are 
currently 32 known members of this family. Clone libraries . 
were created from genomic material based on hybridization to 
nucleic acid probes in solution that contain sequences 

20 complementary to sequence motifs that are indicative of 
members of this gene family. The genomic inserts were 
approximately 15 kb in size. Most of the inserts contain 
sequences from previously known members of the family. 

Conventional approaches involve sequencing these 15 

25 kb inserts over and over, most of the time only to find that 
the insert contains a family member that has already been 
identified. The method of this invention replaced those 
laborious and time consuming steps with a faster, easier 
screening method that can identify which clones contain known 

3 0 members of the family, and which few clones out of the large 
library are worth investigating in greater detail. 



TGF-fl clone screening polynucleotide array: 

The array contained over 12,000 features with 
35 different probes with single- increment , 4 -base trellis tilings 
for 99 bases for each of the 32 known members of the TGF-/3 
family (see Fig. 5) . The 99 bases were from 4 different 
regions of the genes and the contiguous regions range in size 
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from 18 to 30 bases. The interrogated regiona were chosen 
based on a few criteria: they include regions that are (a) 
reasonably well conserved (highly conserved at the amn.no acxd 
level but less so at the DNA level) and that serve as 
identifiers of the protein family, (b) highly variable and 
serve as unique identifiers of individual members of the 
family, and (c) not near expected intron/exon boundaries. 

-vr-v- p rlo^ ^tri ples for hybr j flj zat ion : 

Either DNA or RNA can be produced from a clone using 
standard methods, e.g., nucleic acid extractions followed by 
PGR or in vitro transcription, with labeled bases incorporated 
during the polymerization step. Fragmented single -stranded 
DNA or RNA can be used in the hybridization as well as 
fragmented double stranded DNA. The hybridizations are done 
in either 6XSSPE-T or 3M TMAC1-T (buffered with Tris to avoxd 
having any Na ions in the hybridization solution) , and 
generally at temperatures above 30 °C to improve discriminatxon 
and to reduce cross -hybridization (this is more important m 
this application than for some re- sequencing applications 
c " vv , . , . - c nnn ha e PS ) if labeled RNA xs 

because the samples xnclude -15,000 bases). xr xao 

used, samples are fragmented with heat in the presence of 
Mg 2+ If DNA is used, samples are fragmented by treatment 
with DNAse I prior to hybridization. This works with both 
double stranded DNA or with DNA that is made single- stranded 
following PGR by degradation of one of the strands using 
lambda exonuclease. 

fT?mr -i op and data analysis; 

Following hybridization and reading of the arrays, 

the images are analyzed using the TGF report GeneChip software 
(Affymetrix. Inc., Santa Clara, CA, USA) Base calls were 

.nade over all 99 bases for each of the 32 different regxons . 

The calls were compared with the sequences expected for each 
of the 32 known wild type sequences (see Figs. 5 and 6) . For 
each the results of the base calling were listed, and the 
output was sorted based on the number of calls (# correct) 
that match the expected sequence in that region. In all the 
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..... „ .... -^ r; -nrr; rrsssu" 

* «f the four different regions for tne t P 
each of the tou - _ arer pict ure of where the 

in the list, giving a clearer p hybrid i Z ation 

differences occur. Result ux. .r 

s^rsr~ -S'* — - — ■ is shown in 

Fig - 6 ' • „, » disease is made by obtaining a sample 

! tissue, or other nucleic-acid containing 

of bodily fluids, tissue, o aomience signature present 

serial and determines whether a sequence srg For 
in a possibXe patbo 3 en or set * sig ^ ture of a 

^cobacterxum rs P" 9 ^ * £ nQ ^.n. is present, 

publication no pathogen lB explored, 

the presence o£ another susp novel me thod for 

• Tssiro^o^tira^s. While specific 
performing assays on p y» description is 

examples have been P"***-^ variatio ns of tbe 

illustrative and not restrictive^ Ma y ^ ^ 

invention will become apparent to those inve ntion 
upon review of this ^"^"^^Teference to the 
should, therefore, be ^"*»~*^ d "£ determio ed with 
rrC^L.^. - -r full scope 

of equivalents. oate nt documents cited in this 

All publications and patent ao ent iret Y 
• .nmorated bv reference in their entirety 
application are incorporated by indiv idual 
for all purposes to the same extent as ^ denoted> 
publication or patent document were so individ 
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WHAT IS CLAIMED IS : 

1. A method for determining whether a target 
molecule has a sequence from a gene family member comprising: 

providing an polynucleotide array comprising, 
for each of at least two different gene family members, a set 
of polynucleotide probes that define a reference nucleotide 
sequence from the gene family member; 

generating hybridization data by performing a 
hybridization reaction between the target nucleic acid 
molecule and the probes in the sets and detecting 
hybridization between the target nucleic acid molecule and 
each of the probes in the sets; and 

processing the hybridization data to determine 
whether the target nucleic acid has the reference sequence 
from one of the gene family members . 

2. The method of claim 1 further comprising the 
step of selecting the target nucleic acid molecule by 
determining whether the target hybridizes to a nucleic acid 
probe that hybridizes to a gene encoding the gene family 
members - 

3 . The method of claim l wherein the step of 
processing is performed by a programmable digital computer. 

4. The method of claim 1 wherein the 
polynucleotide array further comprises, for each of the gene 
family members, a probe set defining a highly conserved region 
of the gene and a probe set defining a highly variable region 
of the gene. 

5. The method of claim 1 wherein the 
polynucleotide array further comprises, for each of the gene 
family members, probe sets defining at least two highly 
conserved regions of the gene and probe sets defining at least 
two highly variable regions of the gene. 
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1 6. The method of claim 1 whereiMhe region codes 

2 for amino acid sequence and the array further comprises probe 

3 sets defining the degenerate set of nucleotide sequences 

4 encoding the amino acid sequence. 

1 7. The method of claim 1 further comprising the 

2 step of determining the nucleotide sequence of the target 

mini A i— i 4 ^ «... "i it— .. 



w y c u 

nucleic acid molecule if the target does not have the sequence 
of the region of a gene family member. 



1 8. A method of determining whether a nucleic acid 

2 in a sample is a method of a gene family, comprising: 

selecting a hierarchy of assay techniques 

comprising at least a first and second assay, said first assay 
being selected to provide a determination of a presence 
absence, or variant of a first sequence signature and said 
second assay being selected to provide a determination of a 
8 presence, absence or variant of a second sequence signature- 
wherein at least one of said assays employs a high-density ' 
nucleic acid array; 

analyzing said nucleic acid sample using said 
first assay; and 

determining whether said nucleic acid is a 
member of said gene family based on the results of said first 
15 and second assays. 

1 9. The method of claim 8, wherein said first 

2 sequence signature is a highly conserved region of a gene 

3 family. 

1 10. The method of claim 8, wherein said second 

2 sequence signature is a non- conserved region of a gene family. 

1 11. The method of claim 8, further comprising 

2 determining the full length sequence of said nucleic acid 

3 sample . 
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12 . The method of claim .8, wherein said gene family 
is the TGF-beta family. 

13 The method of claim 8, wherein said first or 
second seance signature is hetween 10 and ,00 nucleotides in 
length. 

14 The method of claim 13. "herein said first or 
second sequence signature is hetween 18 and 30 nucleotides rn 
length. 

15 . a method of selecting clones for analysis 

comprising: . . _ var i et v of clones 

providing a support having a variety 

associated therewith; 

exposing said support to one or more 

— ^ Bai \^r n rrd'' Clones that hyhridize with 
said P-vnucleotide^^ ^ _ ^ ^ ^ ^ 

identified in said identifying step for analyse. 

l«. The method of claim 15. wherein said support is 
2 a high- density nucleic acid array. 

17 . A method of narrowing a sample for analysis, 
comprising, ^ a ganple contalnin g nucleic acids, 

analyzing whether said sample contains a 
sequence signature using a high-density nucleic acid array; 

6 further analyzing said nucleic acid sample only 

] if said sequence signature is not present. 

1, A high density nucleic acid array comprising 
\ sequence signatures from the TSF-heta gene family. 
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